In [ ]:
%matplotlib inline
import json
import codecs
This notebook contains examples for using web-based APIs (Application Programmer Interfaces) to download data from social media platforms. Our examples will include:
For most services, we need to register with the platform in order to use their API. Instructions for the registration processes are outlined in each specific section below.
We will use APIs because they can be much faster than manually copying and pasting data from the web site, APIs provide uniform methods for accessing resources (searching for keywords, places, or dates), and it should conform to the platform's terms of service (important for partnering and publications). Note however that each of these platforms has strict limits on access times: e.g., requests per hour, search history depth, maximum number of items returned per request, and similar.
Reddit's API used to be the easiest to use since it did not require credentials to access data on its subreddit pages. Unfortunately, this process has been changed, and developers now need to create a Reddit application on Reddit's app page located here: (
In [ ]:
# For our first piece of code, we need to import the package
# that connects to Reddit. Praw is a thin wrapper around reddit's
# web APIs and works well
import praw
Go to Scroll down to "create application", select "web app", and provide a name, description, and URL (which can be anything).
After you press "create app", you will be redirected to a new page with information about your application. Copy the unique identifiers below "web app" and beside "secret". These are your client_id and client_secret values, which you need below.
In [ ]:
# Now we specify a "unique" user agent for our code
# This is primarily for identification, I think, and some
# user-agents of bad actors might be blocked
redditApi = praw.Reddit(client_id='OdpBKZ1utVJw8Q',
In [ ]:
subreddit = "worldnews"
targetSub = redditApi.subreddit(subreddit)
submissions =
for post in submissions:
In [ ]:
subreddit = "worldnews"
targetSub = redditApi.subreddit(subreddit)
submissions =
for post in submissions:
In [ ]:
subreddit = "worldnews+aww"
targetSub = redditApi.subreddit(subreddit)
submissions =
for post in submissions:
While you're never supposed to read the comments, for certain live streams or new and rising posts, the comments may provide useful insight into events on the ground or people's sentiment. New posts may not have comments yet though.
Comments are attached to the post title, so for a given submission, you can pull its comments directly.
Note Reddit returns pages of comments to prevent server overload, so you will not get all comments at once and will have to write code for getting more comments than the top ones returned at first. This pagination is performed using the MoreXYZ objects (e.g., MoreComments or MorePosts).
In [ ]:
subreddit = "worldnews"
breadthCommentCount = 5
targetSub = redditApi.subreddit(subreddit)
submissions =
for post in submissions:
print (post.title)
post.comment_limit = breadthCommentCount
# Get the top few comments
for comment in post.comments.list():
if isinstance(comment, praw.models.MoreComments):
print ("---",, "---")
print ("\t", comment.body)
for reply in comment.replies.list():
if isinstance(reply, praw.models.MoreComments):
print ("\t", "---",, "---")
print ("\t\t", reply.body)
Reddit has a deep comment structure, and the code above only goes two levels down (top comment and top comment reply). You can view Praw's additional functionality, replete with examples on its website here:
Getting access to Facebook's API is slightly easier than Twitter's in that you can go to the Graph API explorer, grab an access token, and immediately start playing around with the API. The access token isn't good forever though, so if you plan on doing long-term analysis or data capture, you'll need to go the full OAuth route and generate tokens using the approved paths.
In [ ]:
# As before, the first thing we do is import the Facebook
# wrapper
import facebook
Facebook has a "Graph API" that lets you explore its social graph. For privacy concerns, however, Facebook's Graph API is extremely limited in the kinds of data it can view. For instance, Graph API applications can now only view profiles of people who already have installed that particular application. These restrictions make it quite difficult to see a lot of Facebook's data.
That being said, Facebook does have many popular public pages (e.g., BBC World News), and articles or messages posted by these public pages are accessible. In addition, many posts and comments made in reply to these public posts are also publically available for us to explore.
To connect to Facebook's API though, we need an access token (unlike Reddit's API). Fortunately, for research and testing purposes, getting an access token is very easy.
In [ ]:
Now we can use the Facebook Graph API with this temporary access token (it does expire after maybe 15 minutes).
In [ ]:
# Connect to the graph API, note we use version 2.5
graph = facebook.GraphAPI(access_token=fbAccessToken, version='2.5')
To get a public page's posts, all you need is the name of the page. Then we can pull the page's feed, and for each post on the page, we can pull its comments and the name of the comment's author. While it's unlikely that we can get more user information than that, author name and sentiment or text analytics can give insight into bursting topics and demographics.
In [ ]:
# What page to look at?
targetPage = "nytimes"
# Other options for pages:
# nytimes, bbc, bbcamerica, bbcafrica, redcross, disaster
maxPosts = 10 # How many posts should we pull?
maxComments = 5 # How many comments for each post?
post = graph.get_object(id=targetPage + '/feed')
# For each post, print its message content and its ID
for v in post["data"][:maxPosts]:
print ("---")
print (v["message"], v["id"])
# For each comment on this post, print its number,
# the name of the author, and the message content
print ("Comments:")
comments = graph.get_object(id='%s/comments' % v["id"])
for (i, comment) in enumerate(comments["data"][:maxComments]):
print ("\t", i, comment["from"]["name"], comment["message"])
Twitter's API is probably the most useful and flexible but takes several steps to configure. To get access to the API, you first need to have a Twitter account and have a mobile phone number (or any number that can receive text messages) attached to that account. Then, we'll use Twitter's developer portal to create an "app" that will then give us the keys tokens and keys (essentially IDs and passwords) we will need to connect to the API.
So, in summary, the general steps are:
We will then plug these four strings into the code below.
In [ ]:
# For our first piece of code, we need to import the package
# that connects to Twitter. Tweepy is a popular and fully featured
# implementation.
import tweepy
For more in-depth instructions for creating a Twitter account and/or setting up a Twitter account to use the following code, I will provide a walkthrough on configuring and generating this information.
First, we assume you already have a Twitter account. If this is not true, either create one real quick or follow along. See the attached figures.
Step 1. Create a Twitter account If you haven't already done this, do this now at
Step 2. Setting your mobile number Log into Twitter and go to "Settings." From there, click "Mobile" and fill in an SMS-enabled phone number. You will be asked to confirm this number once it's set, and you'll need to do so before you can create any apps for the next step.
In [ ]:
# Use the strings from your Twitter app webpage to populate these four
# variables. Be sure and put the strings BETWEEN the quotation marks
# to make it a valid Python string.
consumer_key = "IQ03DPOdXz95N3rTm2iMNE8va"
consumer_secret = "0qGHOXVSX1D1ffP7BfpIxqFalLfgVIqpecXQy9SrUVCGkJ8hmo"
access_token = "867193453159096320-6oUq9riQW8UBa6nD3davJ0SUe9MvZrZ"
access_secret = "5zMwq2DVhxBnvjabM5SU2Imkoei3AE6UtdeOQ0tzR9eNU"
In [ ]:
# Now we use the configured authentication information to connect
# to Twitter's API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)
print("Connected to Twitter!")
In [ ]:
# Get tweets from our timeline
public_tweets = api.home_timeline()
# print the first five authors and tweet texts
for tweet in public_tweets[:5]:
print (,, "said:", tweet.text)
Now that we're connected, we can search Twitter for specific keywords with relative ease just like you were using Twitter's search box. While this search only goes back 7 days and/or 1,500 tweets (whichever is less), it can be powerful if an event you want to track just started.
Note that you might have to deal with paging if you get lots of data. Twitter will only return you one page of up to 100 tweets at a time.
In [ ]:
# Our search string
queryString = "earthquake"
# Perform the search
matchingTweets =
print ("Searched for:", queryString)
print ("Number found:", len(matchingTweets))
# For each tweet that matches our query, print the author and text
print ("\nTweets:")
for tweet in matchingTweets:
print (, tweet.text)
Twitter's Search API exposes many capabilities, like filtering for media, links, mentions, geolocations, dates, etc. We can access these capabilities directly with the search function.
For a list of operators Twitter supports, go here:
In [ ]:
# Lets find only media or links about earthquakes
queryString = "earthquake (filter:media OR filter:links)"
# Perform the search
matchingTweets =
print ("Searched for:", queryString)
print ("Number found:", len(matchingTweets))
# For each tweet that matches our query, print the author and text
print ("\nTweets:")
for tweet in matchingTweets:
print (, tweet.text)
In [ ]:
# Lets find only media or links about earthquakes
queryString = "earthquake (filter:media OR filter:links)"
# How many tweets should we fetch? Upper limit is 1,500
maxToReturn = 100
# Perform the search, and for each tweet that matches our query,
# print the author and text
print ("\nTweets:")
for status in tweepy.Cursor(, q=queryString).items(maxToReturn):
print (, status.text)
The Tweepy wrapper and Twitter API is pretty extensive. You can do things like pull the last 3,200 tweets from other users' timelines, find all retweets of your account, get follower lists, search for users matching a query, etc.
More information on Tweepy's capabilities are available at its documentation page: (
Other information on the Twitter API is available here: (
Up to this point, all of our work has been retrospective. An event has occurred, and we want to see how Twitter responded over some period of time.
To follow an event in real time, Twitter and Tweepy support Twitter streaming. Streaming is a bit complicated, but it essentially lets of track a set of keywords, places, or users.
To keep things simple, I will provide a simple class and show methods for printing the first few tweets. Larger solutions exist specifically for handling Twitter streaming.
You could take this code though and easily extend it by writing data to a file rather than the console. I've marked where that code could be inserted.
In [ ]:
# First, we need to create our own listener for the stream
# that will stop after a few tweets
class LocalStreamListener(tweepy.StreamListener):
"""A simple stream listener that breaks out after X tweets"""
# Max number of tweets
maxTweetCount = 10
# Set current counter
def __init__(self):
self.currentTweetCount = 0
# For writing out to a file
self.filePtr = None
# Create a log file
def set_log_file(self, newFile):
if ( self.filePtr ):
self.filePtr = newFile
# Close log file
def close_log_file(self):
if ( self.filePtr ):
# Pass data up to parent then check if we should stop
def on_data(self, data):
print (self.currentTweetCount)
tweepy.StreamListener.on_data(self, data)
if ( self.currentTweetCount >= self.maxTweetCount ):
return False
# Increment the number of statuses we've seen
def on_status(self, status):
self.currentTweetCount += 1
# Could write this status to a file instead of to the console
print (status.text)
# If we have specified a file, write to it
if ( self.filePtr ):
self.filePtr.write("%s\n" % status._json)
# Error handling below here
def on_exception(self, exc):
print (exc)
def on_limit(self, track):
"""Called when a limitation notice arrives"""
print ("Limit", track)
def on_error(self, status_code):
"""Called when a non-200 status code is returned"""
print ("Error:", status_code)
return False
def on_timeout(self):
"""Called when stream connection times out"""
print ("Timeout")
def on_disconnect(self, notice):
"""Called when twitter sends a disconnect notice
print ("Disconnect:", notice)
def on_warning(self, notice):
print ("Warning:", notice)
"""Called when a disconnection warning message arrives"""
Now we set up the stream using the listener above
In [ ]:
listener = LocalStreamListener()
localStream = tweepy.Stream(api.auth, listener)
In [ ]:
# Stream based on keywords
localStream.filter(track=['earthquake', 'disaster'])
In [ ]:
listener = LocalStreamListener()
localStream = tweepy.Stream(api.auth, listener)
# List of screen names to track
screenNames = ['bbcbreaking', 'CNews', 'bbc', 'nytimes']
# Twitter stream uses user IDs instead of names
# so we must convert
userIds = []
for sn in screenNames:
user = api.get_user(sn)
# Stream based on users
In [ ]:
listener = LocalStreamListener()
localStream = tweepy.Stream(api.auth, listener)
# Specify coordinates for a bounding box around area of interest
# In this case, we use San Francisco
swCornerLat = 36.8
swCornerLon = -122.75
neCornerLat = 37.8
neCornerLon = -121.75
boxArray = [swCornerLon, swCornerLat, neCornerLon, neCornerLat]
# Say we want to write these tweets to a file
listener.set_log_file("tweet_log.json", "w", "utf8"))
# Stream based on location
# Close the log file
In [ ]: